Search for: All records

Creators/Authors contains: "Iyer, Ravishankar K."

« Prev Next »

Total Resources

12

Resource Type
Conference Paper

9

Conference Proceeding

0

Dataset

0

Journal Article

3

Workshop Report

0

Availability
Full Text / Resource Available

11

Citation Only

1

Save Results
Excel (limit 2000)
CSV (limit 5000)
XML (limit 5000)

Have feedback or suggestions for a way to improve these results?
!

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

AWARE: Automate Workload Autoscaling with Reinforcement Learning in Production Cloud Systems

Qiu, Haoran ; Mao, Weichao ; Wang, Chen ; Franke, Hubertus ; Yousseff, Alaa ; Kalbarczyk, Zbigniew T. ; Iyer, Ravishankar K ( July 2023 , 2023 USENIX Annual Technical Conference (USENIX ATC 23))

Workload autoscaling is widely used in public and private cloud systems to maintain stable service performance and save resources. However, it remains challenging to set the optimal resource limits and dynamically scale each workload at runtime. Reinforcement learning (RL) has recently been proposed and applied in various systems tasks, including resource management. In this paper, we first characterize the state-of-the-art RL approaches for workload autoscaling in a public cloud and point out that there is still a large gap in taking the RL advances to production systems. We then propose AWARE, an extensible framework for deploying and managing RL-based agents in production systems. AWARE leverages meta-learning and bootstrapping to (a) automatically and quickly adapt to different workloads, and (b) provide safe and robust RL exploration. AWARE provides a common OpenAI Gym-like RL interface to agent developers for easy integration with different systems tasks. We illustrate the use of AWARE in the case of workload autoscaling. Our experiments show that AWARE adapts a learned autoscaling policy to new workloads 5.5x faster than the existing transfer-learning-based approach and provides stable online policy-serving performance with less than 3.6% reward degradation. With bootstrapping, AWARE helps achieve 47.5% and 39.2% higher CPU and memory utilization while reducing SLO violations by a factor of 16.9x during policy training.
more » « less
Free, publicly-accessible full text available July 1, 2024
SIMPPO: a scalable and incremental online learning framework for serverless resource management

https://doi.org/10.1145/3542929.3563475

Qiu, Haoran ; Mao, Weichao ; Patke, Archit ; Wang, Chen ; Franke, Hubertus ; Kalbarczyk, Zbigniew T. ; Başar, Tamer ; Iyer, Ravishankar K. ( November 2022 , Proceedings of the 13th ACM Symposium on Cloud Computing (SoCC 2022))

Serverless Function-as-a-Service (FaaS) offers improved programmability for customers, yet it is not server-“less” and comes at the cost of more complex infrastructure management (e.g., resource provisioning and scheduling) for cloud providers. To maintain function service-level objectives (SLOs) and improve resource utilization efficiency, recent research has been focused on applying online learning algorithms such as reinforcement learning (RL) to manage resources. Compared to rule-based solutions with heuristics, RL-based approaches eliminate humans in the loop and avoid the painstaking generation of heuristics. Despite the initial success of applying RL, we first show in this paper that the state-of-the-art single-agent RL algorithm (S-RL) suffers up to 4.8x higher p99 function latency degradation on multi-tenant serverless FaaS platforms compared to isolated environments and is unable to converge during training. We then design and implement a scalable and incremental multi-agent RL framework based on Proximal Policy Optimization (SIMPPO). Our experiments on widely used serverless benchmarks demonstrate that in multi-tenant environments, SIMPPO enables each RL agent to efficiently converge during training and provides online function latency performance comparable to that of S-RL trained in isolation (which we refer to as the baseline for assessing RL performance) with minor degradation (<9.2%). In addition, SIMPPO reduces the p99 function latency by 4.5x compared to S-RL in multi-tenant cases.
more » « less
Full Text Available
Reinforcement learning for resource management in multi-tenant serverless platforms

https://doi.org/10.1145/3517207.3526971

Qiu, Haoran ; Mao, Weichao ; Patke, Archit ; Wang, Chen ; Franke, Hubertus ; Kalbarczyk, Zbigniew T. ; Başar, Tamer ; Iyer, Ravishankar K. ( April 2022 , EuroMLSys 2022 - Proceedings of the 2nd European Workshop on Machine Learning and Systems)

Serverless Function-As-A-Service (FaaS) is an emerging cloud computing paradigm that frees application developers from infrastructure management tasks such as resource provisioning and scaling. To reduce the tail latency of functions and improve resource utilization, recent research has been focused on applying online learning algorithms such as reinforcement learning (RL) to manage resources. Compared to existing heuristics-based resource management approaches, RL-based approaches eliminate humans in the loop and avoid the painstaking generation of heuristics. In this paper, we show that the state-of-The-Art single-Agent RL algorithm (S-RL) suffers up to 4.6x higher function tail latency degradation on multi-Tenant serverless FaaS platforms and is unable to converge during training. We then propose and implement a customized multi-Agent RL algorithm based on Proximal Policy Optimization, i.e., multi-Agent PPO (MA-PPO). We show that in multi-Tenant environments, MA-PPO enables each agent to be trained until convergence and provides online performance comparable to S-RL in single-Tenant cases with less than 10% degradation. Besides, MA-PPO provides a 4.4x improvement in S-RL performance (in terms of function tail latency) in multi-Tenant cases.
more » « less
Full Text Available
Is Function-as-a-Service a Good Fit for Latency-Critical Services?

https://doi.org/10.1145/3493651.3493666

Qiu, Haoran ; Jha, Saurabh ; Banerjee, Subho S. ; Patke, Archit ; Wang, Chen ; Hubertus, Franke ; Kalbarczyk, Zbigniew T. ; Iyer, Ravishankar K. ( December 2021 , WoSC '21: Proceedings of the Seventh International Workshop on Serverless Computing (WoSC7) 2021)

Function-as-a-Service (FaaS) is becoming an increasingly popular cloud-deployment paradigm for serverless computing that frees application developers from managing the infrastructure. At the same time, it allows cloud providers to assert control in workload consolidation, i.e., co-locating multiple containers on the same server, thereby achieving higher server utilization, often at the cost of higher end-to-end function request latency. Interestingly, a key aspect of serverless latency management has not been well studied: the trade-off between application developers' latency goals and the FaaS providers' utilization goals. This paper presents a multi-faceted, measurement-driven study of latency variation in serverless platforms that elucidates this trade-off space. We obtained production measurements by executing FaaS benchmarks on IBM Cloud and a private cloud to study the impact of workload consolidation, queuing delay, and cold starts on the end-to-end function request latency. We draw several conclusions from the characterization results. For example, increasing a container's allocated memory limit from 128 MB to 256 MB reduces the tail latency by 2× but has 1.75× higher power consumption and 59% lower CPU utilization.
more » « less
Full Text Available
Individualized Seizure Cluster Prediction Using Machine Learning and Chronic Ambulatory Intracranial EEG

https://doi.org/10.1109/TNB.2023.3275037

Saboo, Krishnakant V. ; Cao, Yurui ; Kremen, Vaclav ; Sladky, Vladimir ; Gregg, Nicholas M. ; Arnold, Paul M. ; Karoly, Philippa J. ; Freestone, Dean R. ; Cook, Mark J. ; Worrell, Gregory A. ; et al ( October 2023 , IEEE Transactions on NanoBioscience)
BayesPerf: minimizing performance monitoring errors using Bayesian statistics

https://doi.org/10.1145/3445814.3446739

Banerjee, Subho S. ; Jha, Saurabh ; Kalbarczyk, Zbigniew ; Iyer, Ravishankar K. ( April 2021 , The 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. (ASPLOS ‘21))
null (Ed.)
Hardware performance counters (HPCs) that measure low-level architectural and microarchitectural events provide dynamic contextual information about the state of the system. However, HPC measurements are error-prone due to non determinism (e.g., undercounting due to event multiplexing, or OS interrupt-handling behaviors). In this paper, we present BayesPerf, a system for quantifying uncertainty in HPC measurements by using a domain-driven Bayesian model that captures microarchitectural relationships between HPCs to jointly infer their values as probability distributions. We provide the design and implementation of an accelerator that allows for low-latency and low-power inference of the BayesPerf model for x86 and ppc64 CPUs. BayesPerf reduces the average error in HPC measurements from 40.1% to 7.6% when events are being multiplexed. The value of BayesPerf in real-time decision-making is illustrated with a simple example of scheduling of PCIe transfers.
more » « less
Full Text Available
Delay sensitivity-driven congestion mitigation for HPC systems

https://doi.org/10.1145/3447818.3460362

Patke, Archit ; Jha, Saurabh ; Qiu, Haoran ; Brandt, Jim ; Gentile, Ann ; Greenseid, Joe ; Kalbarczyk, Zbigniew ; Iyer, Ravishankar K. ( June 2021 , Proceedings of The 35th ACM International Conference on Supercomputing (ICS ‘21))
null (Ed.)
Modern high-performance computing (HPC) systems concurrently execute multiple distributed applications that contend for the high-speed network leading to congestion. Consequently, application runtime variability and suboptimal system utilization are observed in production systems. To address these problems, we propose Netscope, a congestion mitigation framework based on a novel delay sensitivity metric. Delay sensitivity of an application is used to quantify the impact of congestion on its runtime. Netscope uses delay sensitivity estimates to drive a congestion mitigation mechanism to selectively throttle applications that are less susceptible to congestion. We evaluate Netscope on two Cray Aries systems, including a production supercomputer, on common scientific applications. Our evaluation shows that Netscope has a low training cost and accurately estimates the impact of congestion on application runtime with a correlation between 0.7 and 0.9. Moreover, Netscope reduces application tail runtime increase by up to 16.3x while improving the median system utility by 12%.
more » « less
Full Text Available
FIRM: An Intelligent Fine-Grained Resource Management Framework for SLO-Oriented Microservices

Qiu, Haoran ; Banerjee, Subho S. ; Jha, Saurabh ; Kalbarczyk, Zbigniew T. ; Iyer, Ravishankar K. ( November 2020 , Proceedings of The 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI ‘20))
null (Ed.)
User-facing latency-sensitive web services include numerous distributed, intercommunicating microservices that promise to simplify software development and operation. However, multiplexing of compute resources across microservices is still challenging in production because contention for shared resources can cause latency spikes that violate the service level objectives (SLOs) of user requests. This paper presents FIRM, an intelligent fine-grained resource management framework for predictable sharing of resources across microservices to drive up overall utilization. FIRM leverages online telemetry data and machine-learning methods to adaptively (a) detect/localize microservices that cause SLO violations, (b) identify low-level resources in contention, and (c) take actions to mitigate SLO violations via dynamic reprovisioning. Experiments across four microservice benchmarks demonstrate that FIRM reduces SLO violations by up to 16Å~ while reducing the overall requested CPU limit by up to 62%. Moreover, FIRM improves performance predictability by reducing tail latencies by up to 11Å~.
more » « less
Full Text Available
Live Forensics for HPC Systems: A Case Study on Distributed Storage Systems

https://doi.org/10.5555/3433701.3433787

Jha, Saurabh ; Cui, Shengkun ; Banerjee, Subho ; Xu, Tianyin ; Enos, Jeremy ; Showerman, Mike ; Kalbarczyk, Zbigniew T. ; Iyer, Ravishankar K. ( November 2020 , Proceedings of the International Conference for High-Performance Computing, Networking, Storage and Analysis (SC 2020))
null (Ed.)
Large-scale high-performance computing systems frequently experience a wide range of failure modes, such as reliability failures (e.g., hang or crash), and resource overload-related failures (e.g., congestion collapse), impacting systems and applications. Despite the adverse effects of these failures, current systems do not provide methodologies for proactively detecting, localizing, and diagnosing failures. We present Kaleidoscope, a near real-time failure detection and diagnosis framework, consisting of of hierarchical domain-guided machine learning models that identify the failing components, the corresponding failure mode, and point to the most likely cause indicative of the failure in near real-time (within one minute of failure occurrence). Kaleidoscope has been deployed on Blue Waters supercomputer and evaluated with more than two years of production telemetry data. Our evaluation shows that Kaleidoscope successfully localized 99.3% and pinpointed the root causes of 95.8% of 843 real-world production issues, with less than 0.01% runtime overhead.
more » « less
Full Text Available
Prediction of short-term antidepressant response using probabilistic graphical models with replication across multiple drugs and treatment settings

https://doi.org/10.1038/s41386-020-00943-x

Athreya, Arjun P. ; Brückl, Tanja ; Binder, Elisabeth B. ; John Rush, A. ; Biernacka, Joanna ; Frye, Mark A. ; Neavin, Drew ; Skime, Michelle ; Monrad, Ditlev ; Iyer, Ravishankar K. ; et al ( June 2021 , Neuropsychopharmacology)
null (Ed.)
Abstract Heterogeneity in the clinical presentation of major depressive disorder and response to antidepressants limits clinicians’ ability to accurately predict a specific patient’s eventual response to therapy. Validated depressive symptom profiles may be an important tool for identifying poor outcomes early in the course of treatment. To derive these symptom profiles, we first examined data from 947 depressed subjects treated with selective serotonin reuptake inhibitors (SSRIs) to delineate the heterogeneity of antidepressant response using probabilistic graphical models (PGMs). We then used unsupervised machine learning to identify specific depressive symptoms and thresholds of improvement that were predictive of antidepressant response by 4 weeks for a patient to achieve remission, response, or nonresponse by 8 weeks. Four depressive symptoms (depressed mood, guilt feelings and delusion, work and activities and psychic anxiety) and specific thresholds of change in each at 4 weeks predicted eventual outcome at 8 weeks to SSRI therapy with an average accuracy of 77% ( p = 5.5E-08). The same four symptoms and prognostic thresholds derived from patients treated with SSRIs correctly predicted outcomes in 72% ( p = 1.25E-05) of 1996 patients treated with other antidepressants in both inpatient and outpatient settings in independent publicly-available datasets. These predictive accuracies were higher than the accuracy of 53% for predicting SSRI response achieved using approaches that (i) incorporated only baseline clinical and sociodemographic factors, or (ii) used 4-week nonresponse status to predict likely outcomes at 8 weeks. The present findings suggest that PGMs providing interpretable predictions have the potential to enhance clinical treatment of depression and reduce the time burden associated with trials of ineffective antidepressants. Prospective trials examining this approach are forthcoming.
more » « less
Full Text Available

« Prev Next »